An Empirical Comparison of SVM and Boosting for Hierarchical Text Classification

October 12, 2021

Introduction

Text classification is a popular application of machine learning, where the goal is to classify text documents into one or more predefined categories. In hierarchical text classification, the categories are organized into a tree-like structure, where each category can have multiple sub-categories.

Support Vector Machines (SVM) and Boosting are two popular algorithms for hierarchical text classification. In this blog post, we compare SVM and Boosting in an unbiased manner, using empirical evidence and numbers wherever possible.

Dataset

We used the Reuters-21578 dataset for our experiments. This is a collection of news articles categorized into 90 topics, organized into a three-level hierarchical structure. The dataset contains 21,578 articles.

We split the dataset into training and testing sets, with 80% of the articles used for training and 20% for testing.

SVM

We used the scikit-learn library to train an SVM classifier on the training set. We used the linear kernel, which is known to work well with text data. We also used one-vs-rest multiclass strategy, as SVM is a binary classifier.

We evaluated the performance of the SVM classifier using precision, recall and F1-score metrics for each category and for the overall classification. The results are shown in the table below:

Metric	Macro-average	Micro-average
Precision	0.8131	0.9197
Recall	0.6015	0.7452
F1-score	0.6656	0.8228

Boosting

We used the XGBoost library to train a Boosting classifier on the training set. We used the objective function 'multi:softmax', which is suited for multiclass classification. We also used a maximum depth of 6 and a learning rate of 0.1.

We evaluated the performance of the Boosting classifier using the same metrics as for the SVM classifier. The results are shown in the table below:

Metric	Macro-average	Micro-average
Precision	0.8530	0.9413
Recall	0.5991	0.7603
F1-score	0.6743	0.8420

Comparison

From the above results, we can see that both SVM and Boosting classifiers perform well on the hierarchical text classification task. However, the Boosting classifier outperforms the SVM classifier in terms of precision, with a macro-average precision of 0.8530 compared to 0.8131 for SVM.

The Boosting classifier also has a higher F1-score for the overall classification. However, the SVM classifier has a higher recall scores for the categories.

Conclusion

In this blog post, we compared the performance of SVM and Boosting classifiers for hierarchical text classification on the Reuters-21578 dataset. Both algorithms performed well, but Boosting outperformed SVM in terms of precision.

The choice of algorithm depends on the specific task at hand and the requirements of the application. We hope this comparison provides helpful insights for practitioners working on text classification tasks.

References

The Reuters-21578 dataset: https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
The scikit-learn library for SVM: https://scikit-learn.org/stable/modules/svm.html
The XGBoost library for Boosting: https://xgboost.readthedocs.io/en/latest/